Mining Acknowledgements Sections

Author

Eva Maxfield Brown and Chris Fu

Published

October 28, 2022

Introduction

While the current scholarly effort of literature review focuses on understanding published works’ vision, content, method, results, limitation, etc., we aim to find meaningful information from research papers’ acknowledgment section. The acknowledgment section appears in most research papers but does not gather much interest as we know. We want to understand the different aspects of the acknowledgment section, how they are organized, and within a specific field, are there frequently mentioned names and entities? In addition, we will discuss how to incorporate these findings to present helpful information to readers when they use search engines looking for related research interests.

Original Dataset

The original dataset of 64 papers was provided to us as a large JSON file that had a lot data within it. For our analysis of acknowledgements sections we only needed a few data points to get started. The original dataset is available below for exploration (minor change just to make it render nicely).

Show Code for Loading the Original Dataset
from IPython.display import JSON
import json

with open("data/599_lit_review.json", "r") as open_f:
    original_dataset = json.load(open_f)
    
JSON({"data": original_dataset})
<IPython.core.display.JSON object>

Compiled Dataset

For our analysis, we really only needed some metadata and a view or download link for each paper which we could then manually go and copy-paste any acknowledgements section into our dataset (we have some thoughts as to how to automate this in a later section).

To extract the data we needed we ran the following code:

Show Code for Compile Dataset for Manual Addition
import pandas as pd

compiled_rows = []
for index, paper in enumerate(original_dataset):
    # Some papers have data from CSL and some from S2
    # Get both so we don't really have to care later on
    
    # Check if the paper has CSL data at all
    if paper.get("csl", None) is not None:
        # Find or get title and url returned by CSL data
        csl_title = paper["csl"].get("title", None)
        csl_url = paper["csl"].get("URL", None)
    else:
        csl_title = None
        csl_url = None

    # Check if the paper has Semantic Scholar data at all
    if paper.get("s2data", None) is not None:
        # Find or get title and url returned by S2 data
        s2_title = paper["s2data"].get("title", None)
        s2_url = paper["s2data"].get("url", None)
    else:
        s2_title = None
        s2_url = None
    
    # Compile all results
    compiled_rows.append({
        "paper_index": index,
        "doi": paper["doi"],
        "s2id": paper.get("s2id", None),
        "s2_url": s2_url,
        "csl_url": csl_url,
        "s2_title": s2_title,
        "csl_title": csl_title,
        "acknowledgements_text": None,
    })
    
compiled_dataset = pd.DataFrame(compiled_rows)

Our dataset after adding all the acknowledgements sections is available below:

Read and Show Data with Acknowledgements Sections Added
from itables import show
import itables.options as table_opts
table_opts.lengthMenu = [5, 10, 25, 50]

raw_data = pd.read_csv("data/raw-ack-sections.csv")
show(raw_data)
paper_index doi s2id s2_url csl_url s2_title csl_title acknowledgements_text
Loading... (need help?)

NER

We can now take each of these acknowledgements sections and run them through a named entity recognition model.

import spacy

nlp = spacy.load("en_core_web_trf")

# Filter dataset to only include rows with acknowledgements sections
filtered_data = raw_data.dropna(subset=["acknowledgements_text"])

# For each acknowledgement, run it through spacy,
# extract entities and their labels and store to a dataframe
entities_rows = []
docs = []
for _, paper in filtered_data.iterrows():
    doc = nlp(paper.acknowledgements_text)
    docs.append(doc)
    for ent in doc.ents:
        # Store with the DOI so we can join with other data later
        entities_rows.append({
            "doi": paper.doi,
            "entity": ent.text,
            "entity_label": ent.label_,
        })
        
entities = pd.DataFrame(entities_rows)
# How did the model tag each of these examples?
from ipywidgets import interact
from IPython.display import display, HTML
from spacy import displacy

@interact
def render_example(doc_index=list(range(len(docs)))):
    return display(HTML(displacy.render(docs[doc_index], style="ent")))

Here are the most common entity types:

import altair as alt

alt.Chart(entities).mark_bar().encode(
    alt.X("entity_label", sort="-y"),
    y="count()",
    color="entity_label",
    tooltip=["entity_label", "count()"],
).properties(
    width=400,
    height=300
).interactive()

A bulk of the named entities are people and organizations (which is what we would expect and what we are looking for), we can filter out the rest.

# Filter all rows that aren't people or orgs
people_and_org_refs = entities.loc[entities.entity_label.isin(["PERSON", "ORG"])]

This is still too much data to visualize each person or org’s count so let’s just visualize a the top ten referenced people or entities.

top_ten_entities = people_and_org_refs.value_counts(
    subset=["entity", "entity_label"]
).to_frame().reset_index().rename(columns={0: "count"})[:10]
alt.Chart(top_ten_entities).mark_bar().encode(
    alt.X("entity", sort="-y"),
    y="count",
    color="entity",
    tooltip=["entity", "entity_label", "count"],
).properties(
    width=400,
    height=300
).interactive()

Classifying Recognition

blah